{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A taste of MLJ\n",
    "\n",
    "### Baby steps\n",
    "\n",
    "Let's load a reduced version of the well-known Ames House Price data set (containing six of the more important categorical features and six of the more important numerical features):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "using MLJ\n",
    "using DataFrames, Statistics\n",
    "\n",
    "task = load_reduced_ames()\n",
    "Xtable, y = task()\n",
    "\n",
    "train, test = partition(eachindex(y), 0.70, shuffle=true); # 70:30 split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Xtable` is a named-tuple of vectors but we can work with any tabular format supported by Tables.jl. Let's use `DataFrame`s:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>OverallQual</th><th>GrLivArea</th><th>Neighborhood</th><th>x1stFlrSF</th><th>TotalBsmtSF</th><th>BsmtFinSF1</th><th>LotArea</th><th>GarageCars</th><th>MSSubClass</th><th>GarageArea</th><th>YearRemodAdd</th><th>YearBuilt</th></tr><tr><th></th><th>Float64</th><th>Float64</th><th>Categorical…</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Float64</th><th>Categorical…</th><th>Float64</th><th>Float64</th><th>Float64</th></tr></thead><tbody><p>4 rows × 12 columns</p><tr><th>1</th><td>5.0</td><td>816.0</td><td>Mitchel</td><td>816.0</td><td>816.0</td><td>816.0</td><td>6600.0</td><td>2.0</td><td>_20</td><td>816.0</td><td>2003.0</td><td>1982.0</td></tr><tr><th>2</th><td>8.0</td><td>2028.0</td><td>Timber</td><td>2028.0</td><td>1868.0</td><td>1460.0</td><td>11443.0</td><td>3.0</td><td>_20</td><td>880.0</td><td>2006.0</td><td>2005.0</td></tr><tr><th>3</th><td>7.0</td><td>1509.0</td><td>Gilbert</td><td>807.0</td><td>783.0</td><td>0.0</td><td>7875.0</td><td>2.0</td><td>_60</td><td>393.0</td><td>2003.0</td><td>2003.0</td></tr><tr><th>4</th><td>6.0</td><td>1686.0</td><td>NWAmes</td><td>1686.0</td><td>1686.0</td><td>625.0</td><td>10240.0</td><td>2.0</td><td>_20</td><td>612.0</td><td>1980.0</td><td>1980.0</td></tr></tbody></table>"
      ],
      "text/latex": [
       "\\begin{tabular}{r|cccccccccccc}\n",
       "\t& OverallQual & GrLivArea & Neighborhood & x1stFlrSF & TotalBsmtSF & BsmtFinSF1 & LotArea & GarageCars & MSSubClass & GarageArea & YearRemodAdd & YearBuilt\\\\\n",
       "\t\\hline\n",
       "\t1 & 5.0 & 816.0 & Mitchel & 816.0 & 816.0 & 816.0 & 6600.0 & 2.0 & \\_20 & 816.0 & 2003.0 & 1982.0 \\\\\n",
       "\t2 & 8.0 & 2028.0 & Timber & 2028.0 & 1868.0 & 1460.0 & 11443.0 & 3.0 & \\_20 & 880.0 & 2006.0 & 2005.0 \\\\\n",
       "\t3 & 7.0 & 1509.0 & Gilbert & 807.0 & 783.0 & 0.0 & 7875.0 & 2.0 & \\_60 & 393.0 & 2003.0 & 2003.0 \\\\\n",
       "\t4 & 6.0 & 1686.0 & NWAmes & 1686.0 & 1686.0 & 625.0 & 10240.0 & 2.0 & \\_20 & 612.0 & 1980.0 & 1980.0 \\\\\n",
       "\\end{tabular}\n"
      ],
      "text/plain": [
       "4×12 DataFrame. Omitted printing of 7 columns\n",
       "│ Row │ OverallQual │ GrLivArea │ Neighborhood │ x1stFlrSF │ TotalBsmtSF │\n",
       "│     │ \u001b[90mFloat64\u001b[39m     │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mCategorical…\u001b[39m │ \u001b[90mFloat64\u001b[39m   │ \u001b[90mFloat64\u001b[39m     │\n",
       "├─────┼─────────────┼───────────┼──────────────┼───────────┼─────────────┤\n",
       "│ 1   │ 5.0         │ 816.0     │ Mitchel      │ 816.0     │ 816.0       │\n",
       "│ 2   │ 8.0         │ 2028.0    │ Timber       │ 2028.0    │ 1868.0      │\n",
       "│ 3   │ 7.0         │ 1509.0    │ Gilbert      │ 807.0     │ 783.0       │\n",
       "│ 4   │ 6.0         │ 1686.0    │ NWAmes       │ 1686.0    │ 1686.0      │"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = DataFrame(Xtable);\n",
    "first(X, 4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *model* is a container for hyperparameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "# \u001b[0m\u001b[1mConstantRegressor @ 1…36\u001b[22m: \n",
       "target_type             =>   Float64\n",
       "distribution_type       =>   Distributions.Normal{Float64}\n",
       "\n"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "constant_model = ConstantRegressor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wrapping the model in data creates a *machine* which will store training outcomes (called *fit-results*):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "# \u001b[0m\u001b[1mMachine{ConstantRegressor{Float6…} @ 4…98\u001b[22m: \n",
       "model                   =>   \u001b[0m\u001b[1mConstantRegressor @ 1…36\u001b[22m\n",
       "fitresult               =>   (undefined)\n",
       "cache                   =>   (undefined)\n",
       "args                    =>   (omitted Tuple{DataFrame,Array{Float64,1}} of length 2)\n",
       "report                  =>   (undefined)\n",
       "rows                    =>   (undefined)\n",
       "\n"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "constant = machine(constant_model, X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training and making a new prediction:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Training \u001b[0m\u001b[1mMachine{ConstantRegressor{Float6…} @ 4…98\u001b[22m.\n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:93\n",
      "┌ Info: Fitted a constant probability distribution, Distributions.Normal{Float64}(μ=181246.16879293424, σ=76842.2276854426).\n",
      "└ @ MLJ.Constant /Users/anthony/Dropbox/Julia7/MLJ/src/builtins/Constant.jl:46\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "4-element Array{Distributions.Normal{Float64},1}:\n",
       " Distributions.Normal{Float64}(μ=181246.16879293424, σ=76842.2276854426)\n",
       " Distributions.Normal{Float64}(μ=181246.16879293424, σ=76842.2276854426)\n",
       " Distributions.Normal{Float64}(μ=181246.16879293424, σ=76842.2276854426)\n",
       " Distributions.Normal{Float64}(μ=181246.16879293424, σ=76842.2276854426)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fit!(constant, rows=train)\n",
    "yhat = predict(constant, X[test,:]);\n",
    "yhat[1:4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model we chose can provide probabilistic predictions and so does so by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Float64,1}:\n",
       " 181246.16879293424\n",
       " 181246.16879293424\n",
       " 181246.16879293424\n",
       " 181246.16879293424"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean.(yhat)[1:4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4060172606116789"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rmsl(yhat, y[test]) # applies root-mean-square-log loss to prediction means"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, we could have skipped the train/test definitions and evaluated one line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Evaluating using a holdout set. \n",
      "│ fraction_train=0.7 \n",
      "│ shuffle=false \n",
      "│ measure=MLJ.rmsl \n",
      "│ operation=MLJBase.predict \n",
      "│ Resampling from all rows. \n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/resampling.jl:87\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.4095126492179508"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "evaluate!(constant, resampling=Holdout(fraction_train=0.7), measure=rmsl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Something fancier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define a composite model type that:\n",
    "\n",
    "- one-hot encodes the inputs\n",
    "- log transforms the target\n",
    "- fits specified K-nearest neighbour and ridge regressor models to the data\n",
    "- computes a weighted average of individual model predictions\n",
    "- inverse transforms (exponentiates) the blended predictions\n",
    "\n",
    "MLJ will eventually have macros for quickly building this kind of composite model (and more sophisticated model stacks). Nevertheless, building the model with our bare hands is easier than in other popular frameworks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the new model struture, with other models as hyperparameters:\n",
    "mutable struct KNNRidgeBlend <:Deterministic{Node}\n",
    "    knn_model\n",
    "    ridge_model\n",
    "    weight # between 0 and 1\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model fitting instructions:\n",
    "function MLJ.fit(model::KNNRidgeBlend, X, y)\n",
    "    \n",
    "    # source nodes of the \"learning network\" being wrappped:    \n",
    "    \n",
    "    Xs = source(X) \n",
    "    ys = source(y)\n",
    "\n",
    "    hot = machine(OneHotEncoder(), Xs)\n",
    "\n",
    "    # W, z, zhat and yhat are nodes in the network:\n",
    "    \n",
    "    W = transform(hot, Xs) # one-hot encode the input\n",
    "    z = log(ys) # transform the target\n",
    "    \n",
    "    ridge_model = model.ridge_model\n",
    "    knn_model = model.knn_model\n",
    "\n",
    "    ridge = machine(ridge_model, W, z) \n",
    "    knn = machine(knn_model, W, z)\n",
    "\n",
    "    zhat = model.weight*predict(ridge, W) + (1 - model.weight)*predict(knn, W) # average the predictions of KNN and ridge models\n",
    "    yhat = exp(zhat) # inverse the target transformation\n",
    "    \n",
    "    fit!(yhat, verbosity=0)\n",
    "    \n",
    "    return yhat\n",
    "end\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now instantiate a blended model and evaluate it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Evaluating using a holdout set. \n",
      "│ fraction_train=0.7 \n",
      "│ shuffle=false \n",
      "│ measure=MLJ.rmsl \n",
      "│ operation=MLJBase.predict \n",
      "│ Resampling from all rows. \n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/resampling.jl:87\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.13108966715891057"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn_model = KNNRegressor(K=2)\n",
    "ridge_model = RidgeRegressor(lambda=0.1)\n",
    "knn_ridge_blend_model = KNNRidgeBlend(knn_model, ridge_model, 0.90)\n",
    "knn_ridge_blend = machine(knn_ridge_blend_model, X, y)\n",
    "er = evaluate!(knn_ridge_blend, resampling=Holdout(fraction_train=0.7), measure=rmsl) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can inspect the nested hyperparameters of our composite model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(knn_model = (target_type = Float64, K = 2, metric = MLJ.KNN.euclidean, kernel = MLJ.KNN.reciprocal), ridge_model = (target_type = Float64, lambda = 0.1), weight = 0.9)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "params(knn_ridge_blend_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our next trick, we wrap the model in a tuning strategy, producing a new tuned model. \n",
    "\n",
    "The form of `nested_ranges` below matches the pattern above (with parameters not being tuned omitted):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "# \u001b[0m\u001b[1mDeterministicTunedModel @ 9…60\u001b[22m: \n",
       "model                   =>   \u001b[0m\u001b[1mKNNRidgeBlend @ 4…38\u001b[22m\n",
       "tuning                  =>   \u001b[0m\u001b[1mGrid @ 3…17\u001b[22m\n",
       "resampling              =>   \u001b[0m\u001b[1mCV @ 1…08\u001b[22m\n",
       "measure                 =>   rmsl (generic function with 2 methods)\n",
       "operation               =>   predict (generic function with 19 methods)\n",
       "nested_ranges           =>   (omitted NamedTuple{(:knn_model, :ridge_model),Tuple{NamedTuple{(:K,),Tuple{MLJ.NumericRange{Int64,Symbol}}},NamedTuple{(:lambda,),Tuple{MLJ.NumericRange{Float64,Symbol}}}}})\n",
       "full_report             =>   true\n",
       "\n"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "K_range = range(knn_model, :K, lower=2, upper=100, scale=:log10)\n",
    "lambda_range = range(ridge_model, :lambda, lower= 1e-6, upper=10, scale = :log10)\n",
    "\n",
    "nested_ranges = (knn_model = (K = K_range,), ridge_model = (lambda = lambda_range,))\n",
    "\n",
    "tuning = Grid(resolution=12)\n",
    "resampling = CV(nfolds=6)\n",
    "\n",
    "tuned_blended_model = TunedModel(model=knn_ridge_blend_model, \n",
    "    tuning=tuning, resampling=resampling, nested_ranges=nested_ranges, measure=rmsl)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fitting a corresponding machine searches for the model with optimal performance, as determined by the specified tuning/resampling strategy, and trains the best model on all specifed data (in this case on `train`). So, predictions of the fitted machine are predictions using optimal hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Training \u001b[0m\u001b[1mMachine{MLJ.DeterministicTunedMo…} @ 1…38\u001b[22m.\n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/machines.jl:93\n",
      "\u001b[33mSearching a 144-point grid for best model: 100%[=========================] Time: 0:01:34\u001b[39m\n",
      "┌ Info: Training best model on all supplied data.\n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/tuning.jl:142\n"
     ]
    }
   ],
   "source": [
    "tuned_blended = machine(tuned_blended_model, X, y)\n",
    "fit!(tuned_blended, rows=train);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(best_model = \u001b[0m\u001b[1mKNNRidgeBlend @ 1…63\u001b[22m,)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fitted_params(tuned_blended)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(best_model.knn_model).K = 17\n",
      "(best_model.ridge_model).lambda = 2.3101297000831598\n"
     ]
    }
   ],
   "source": [
    "best_model = ans.best_model\n",
    "@show best_model.knn_model.K;\n",
    "@show best_model.ridge_model.lambda;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "┌ Info: Evaluating using a holdout set. \n",
      "│ fraction_train=0.7 \n",
      "│ shuffle=false \n",
      "│ measure=MLJ.rmsl \n",
      "│ operation=MLJBase.predict \n",
      "│ Resampling from all rows. \n",
      "└ @ MLJ /Users/anthony/Dropbox/Julia7/MLJ/src/resampling.jl:87\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.13078937184214492"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "best = machine(best_model, X, y)\n",
    "evaluate!(best, resampling=Holdout(fraction_train=0.7), measure=rmsl)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 1.0.2",
   "language": "julia",
   "name": "julia-1.0"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "1.0.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
